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Abstract 

In this article, we present an effective encoding of dendrograms by embedding 
them into the Bruhat-Tits trees associated to p-adic number fields. As an appli- 
cation, we show how strings over a finite alphabet can be encoded in cyclotomic 
extensions of Q p and discuss p-adic DNA encoding. The application leads to fast 
p-adic agglomerative hierarchic algorithms similar to the ones recently used e.g. by 
A. Khrennikov and others. From the viewpoint of p-adic geometry, to encode a 
dendrogram A in a p-adic field K means to fix a set S of A-rational punctures on 
the p-adic projective line P 1 . To P 1 \ S is associated in a natural way a subtree 
inside the Bruhat-Tits tree which recovers A, a method first used by F. Kato in 
1999 in the classification of discrete subgroups of PGL2(A). 

Next, we show how the p-adic moduli space SJto.n of P 1 with n punctures can 
be applied to the study of time series of dendrograms and those symmetries arising 
from hyperbolic actions on P 1 . In this way, we can associate to certain classes 
of dynamical systems a Mumford curve, i.e. a p-adic algebraic curve with totally 
degenerate reduction modulo p. 

Finally, we indicate some of our results in the study of general discrete actions 
on P , and their relation to p-adic Hurwitz spaces. 

1 Introduction 

Mumford curves arise as the generalisation of the so-called Tate uniformisation of p-adic 
elliptic curves [131 §6]. The latter has a combinatorial description as a Z-action on the 
real line "connecting" the points and oo over Q p . The crucial idea by Mumford [TO] 
was to view the real line as a geodesic line inside the Bruhat-Tits tree 3q p for PGL2(Q P ) 
and to consider a discrete action of a subgroup G generated by g hyperbolic fractional 
linear transformations acting regularly on a subdomain of the p-adic Riemann sphere 
P 1 . It turns out that the orbit space X — 0,/G is a complete algebraic curve of genus 
g, and that not all p-adic algebraic curves admit such a uniformisation. A curve of the 
form X = Cl/G as above is called a Mumford curve or a p-adic Riemann surface. 

Here, we are concerned with the application of p-adic geometry in the analysis of 
hierarchical data. From a geometric viewpoint, the tree ^ p represents the hierarchical 
organisation of all p-adic numbers, including oo. Namely, a p-adic number can in a 
natural way be viewed as an infinite path inside the tree & starting from some vertex 



v. Two paths starting from v correspond to two p-adic numbers having their first terms 
coincide in their p-adic expansions. The more terms they have in common, the closer 
they are p-adically. Hence, for some given p-adic numbers, their geodesic paths in 
will yield a subtree which hierarchically represents their proximities. This motivates 
the usefulness of the Bruhat-Tits tree for hierarchical data analysis by finding a way of 
encoding data as p-adic numbers. Unfortunately, there is no natural way of doing this 
for arbitrary data other than strings over an alphabet. 

Time series of hierarchical data naturally yield the consideration of families of sets 
of p-adic numbers. The corresponding geometric construct is a moduli space of such 
families. Here, they come in the form of Mq^, the p-adic moduli space of n-pointed genus 
zero curves. Classically, these and their variants in higher genus play an important role in 
string theory, and we expect this also to be the case in p-adic string theory. However, for 
data mining, a time series is simply a sequence of points in Mo jn , and it would certainly 
be interesting to be able to interpolate and have a curve inside the moduli space in order 
to say something about the evolution of the time series, or the probability of a certain 
behaviour in time. 

2 Dendrograms 

Dendrograms are a certain way of depicting trees arising in the hierarchical classification 
of data. Their intention is usually to describe hierarchies found within some given dataset. 
However, it is often the result of imposing hierarchies onto the data, depending on the 
choice of a metric. A lot of work by Fionn Murtagh aims to find ultrametricity in data 
in order to reveal underlying hierarchy, e.g. [llU12j . The reason is precisely the tree-like 
structure of any ultrametric distance. From a p-adic viewpoint, the following procedure 
seems natural: 

1. Encode dataset X = {xi, . . . , x n } by p-adic numbers Y. 

2. Construct the dendrogram for X from the code Y. 

The dendrogram for X is uniquely determinded by Y and can be computed quite 
fast. Hence, the true problem is to find a suitable encoding by p-adic numbers. This is in 
general a very difficult task, as one is likely to need the dendrogram a priori. However, 
for strings of letters from a given alphabet, we will show how p-adic encodings can be 
effected in Section [SJ 

A more precise definition of a dendrogram is that of a metrised tree with finitely 
many ends, all of which are labelled. 

In what follows, we assume that to each dataset X, there is given a dendrogram D 
which is supposed to reveal the hierarchical structure within X. 



3 p-adic dendrograms 
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Figure 1: A 2-adic dendrogram. 

Consider the dendrogram D as depicted in Figure [TJ If one goes down from oo along 
a path in D to some datum x — Xi, and picks up the labels or 1 along the way, then 
one gets a 2-adic encoding 

x = a " r e Qa, 

level v 

where coefficient a v is the number picked up at level v. Here, this yields the numbers 

•ti=0, x 2 = 2 6 , £ 3 = 2 5 , Xi = 2 2 , 

x 5 = 2 2 + 2 4 , x 6 = 2 2 + 2 3 , x 7 = 2° + 2\ x s = 1 

Notice that the labels are arranged here in such a way that the code will be x\ = 
at the very left, and x n — 1 at the very right (and oo on top) of the dendrogram. Of 
course, the procedure yields just finite 2-adic expansions of rational numbers. But in this 
way, the whole dendrogram D gets embedded into an infinite tree: the Bruhat-Tits tree 
<5q 2 for the group PGL^tQ^)- For a general prime number p, the tree £?q p is a locally 
finite p + 1-regular tree. The latter means that from each vertex there are precisely p + 1 
emanating edges. The reason is that the vertices can be interpreted as p-adic discs, and 
the edges are given by maximal non-trivial inclusion of discs. It is known that each disc 
has precisely p maximally smaller subdiscs and lies inside precisely one minimal bigger 
disc. Hence, each vertex has precisely p children vertices and one parent vertex. This is 
illustrated in Figure O 

The number p of children vertices comes from the isomorphism Z p /pZ p = ¥ p , saying 
that the residue field of Q p is the finite field F p . Hence, each downward edge can be 



1 - 
2- 
3- 
4- 
5- 
6- 



Figure 2: Local structure of ■ 



labelled by any representative for ¥ p in the ring Z p of p-adic integers. Quite common is 
e.g. the set of labels {0, . . . ,p — 1}. 

Moving downwards from some vertex v on will end in some p-adic number x G Q p 
as the intersection of a decreasing sequence of discs corresponding to the vertices on the 
infinite path, and picking up labels as before, yields a p-adic expansion of a; as a Laurent 
series in p. Hence, all of Q p can be considered as lying at the boundary of -%} p . However, 
there is one more boundary point of ^5q p outside Q p : taking the path going upwards 
from each vertex will lead to the point oo. Hence, we have found 

9% = Q P U{ & o} = P 1 (Q p ), 

where the latter space P 1 is the p-adic projective line. 

We have seen even more that the local picture of the Bruhat-Tits tree allows a local 
interpretation as another projective line: namely, there is a bijection 

{edges emanating from vertex v} = ¥ p U {oo} = P 1 (F p ), 

with the projective line defined this time over the residue field F p . 

Let now be given a finite set X = {xq, . . . ,x n } of p-adic numbers. Then by taking 
inside the Bruhat-Tits tree 5q p all geodesic paths between the points of X, one obtains a 
subtree &(X) which we call ap-adic dendrogram. We give credit to Fumiharu Kato who 
used this construction already in 1999 in the classification of p-adic discrete projective 
linear groups (cf. also §5.2]). 

Observe that the 2-adic encoding of a dendrogram described above yields a 2-adic 
dendrogram for the 2-adically coded data plus the extra "datum" oo. This extra point 
at infinity allows to determine the root of a dendrogram, from which all paths to genuine 
data are oriented downwards, i.e. passing through children vertices. In Figure [1] the root 
corresponds to the unit disc, because on the one hand the data code contains 0, 1 € Q2, 
and the three numbers 0, 1, 00 uniquely determine the unit disc. And on the other hand, 
all data are encoded by numbers within the unit disc. 



4 Non-binary data 



In the previous section, we have seen how to p-adically encode data having a binary 
dendrogram, and we have defined p-adic dendrograms which are not necessarily binary. 
Hence, a natural way of encoding data whose dendrogram D is not binary would be by 
increasing the prime number p to the size of at least the maximal number of children 
vertices in D. 

However, there is an alternative way of doing this without changing the prime p. 
Namely, consider any finite field extension K of Q p . It is well-known that the p-adic 
norm extends uniquely to a norm | \k , and that K is complete for this norm. Again, the 
unit disc is the ring Ok = {x G K \ \x\k < ^} , and the next smaller disc containing is 
ttOk, where ir is a so-called uniformiser and plays the role of the prime p in K. It holds 
true that the residue field 

k := O k /itOk = ¥ q 

is a finite field extension of ¥ p with q = pf elements for some natural number / > 1. 
In general, it holds true that 

/ = dim Fp (k) < dim Qp (K) =: n, (1) 

where the dimensions are of that of vector spaces over the scalar fields Q p and F p , 
respectively. The result is that there are more discs defined over K than over Q p . More 
precisely, the number of "children" disks has increased to q = pf , and there is a new 
Bruhat-Tits tree S/'k which is again infinite, but this time q + 1-rcgular. The analogue 
holds true: 

{edges emanating from vertex v} = P 1 (k). 
In fact, there is an embeding of trees 

given in general by subdividing edges and increasing the number of edges emanating 
from a vertex. Note that that the subdivision of edges comes from a relation between 
the uniformisers: 

\Ak = \P\ P , 

which causes the length of an edge in 2?k to be an e-th fraction of an edge length in 3?q p . 
The number e is called the ramification index of the field extension K/Q p . By adopting 
the labelling method from above, we obtain the general encoding 

oo 
v— — m 



where a v is taken from a system 1Z of representatives in K for the residue field k. This 
is nothing but the 7r-adic expansion of elements from K . 

In the case that ([1]) is an equality, the field extension K/Q) p is called unramified. By 
the well known formula 



this is equivalent to e = 1. In this case, the prime p can be taken as the uniformiser of K, 
and we obtain again p-adic expansions — only with more choice of coefficients. A special 
case is given by a so-called cyclotomic extension K = Q P (C) obtained by adjoining to Q p 
the powers of a primitive (p* — l)-th root £ of unity. This case is known to be unramified, 
and we can take as coefficients 



for the p-adic expansion of elements from K. Note, that for / = 1, this yields a set of 
coefficients different from the usual choice {0, ... ,p — 1}. 

The proofs for most of the statements in this section can be found in (7j Ch. 5]. 

5 Strings over an alphabet 

Let A be a finite alphabet. We will show how to realise p-adic encodings of strings over 
A. 

First, denote by S(A) the set of all strings with letters from A. The subset of finite 
strings will be denoted by Sa n (A). Now, for / sufficiently large, any injective map 



with IZf defined as in ([2]), induces an encoding of S(A) in Ok, where K is the cyclotomic 
field Q(C) with £ a primitive (jpf — l)-th root of unity. Clearly, the finite strings are then 
in bijection with the set of polynomials 7\L/[p] in the prime p whose coefficients are from 
IZf. We even have more [21 Thm. 3.1]: 

Theorem 5.1 There exists a cyclotomic field K = Q P (C) with ^ as above, and a closed 
isometric embedding ip: S(A) — > Ok such that (f>(Sfi n (A)) C 1Zf[p], and is dense in 



n = e • /, 




(2) 



A -> TZ f , 



Here, the metric on S(A) is given by the Baire distance 



S p (x, y) := inf jp n | first n letters of x and y coincide} 



Note that the image <f>(S(A)) is a disc with holes coming from the complement of <j>(A) 
in IZf. More precisely, if x € Ok is represented as 

x = ^2 a vV v ■, a v e TZf, 

the holes are given as the union of open discs 

{a i (f>(A)} U {ax i 4>(A)} U {a 2 i (f>(A)} U . . . 

Note further, that although there are only finitely many encodings cf>: A — > 7?./, there 
are infinitely many p-adic encodings by changing the system 1Z C of representatives 
for the residue field k. 

6 p-adic clustering 

If data X are encoded p-adically, it is a very simple and fast task to retrieve the uniquely 
determined hierarchical structure of D given by the tree 3F(X). Any clustering algorithm 
using the p-adic metric will never need to change the metric when measuring distances 
between disjoint clusters C\ and C*2, because of the fact 

distp(Ci,C 2 ) = \x- y\ p 

for any x G C\ , y S Ci . Essentially, the fact that one seeks a subtree of a tree makes 
things more simple and faster than in the archimedean situation. In .2, §3], an explicit 
form of a p-adic hierarchic classification algorithm has been discussed. Benois-Pineau et 
al. have applied such an algorithm in image segmentation [lj. 

7 DNA 

As an example for what has been said in the previous sections, we discuss p-adic encoding 
of DNA. Here, the alphabet is given as A — {A, G, C, T}, where 

A = Adenine G = Guanine 

C = Cytosine T = Thymine 

Dragovich and Dragovich [6] choose a 5-adic encoding in the field Q5 

DD : .A -> ft = {0,1, 2, 3, 4} 

with <Pdd(A) = {1,2,3,4}. This allows for taking as a "blank" in order to separate 
words made out of A. So, in fact, they use the extended alphabet A U {"blank"} and 
encode it with a bijection to TZ taking the "blank" to 0. 



Khrcnnikov and Kozyrev [9] use a bijection 

as their encoding. As an F2-vector space, is isomorphic to the additive group of the 
finite field F 2 2 with four elements. This field, in turn, is the residue field of the cyclotomic 
field K — Q2 (C) with £ a primitive third root of unity. Because of the correspondence 

<~G). '+<-(!)■ 

their encoding can be interpreted as choosing 7?.xk = {0, 1, £, 1 + £} and a bijection 

0xk : -4 -> 7\L X k, 

which gives a 2-adic encoding. However, there is no "blank" in this case. 

By the previous sections, we see that there are a lot more possibilities, even for 2-adic 
encodings. For a version without "blank" , a bijection with 

n 2 = {0,1, c,C 2 } 

could be used. And for a version with "blank" , an injection into 

^ 3 = {o,u,...,£ 6 } 

could be interesting, where £ is a seventh root of unitjo 
The following questions come up naturally: 

Questions 7.1 1. Are there among the possible 2-adic encodings 

A^TZ 3 

some more preferred than others from the point of view of genomics ? 

2. Which are the best choices for systems 1Z C Ok of representatives for the residue 
field k = F 2 3 from a genomic point of view (possibly including "blank")? 

Of course, there is the question, whether cyclotomic or unramified p-adic fields are 
sufficiently suited for genomics. 

1 Notice that 7 = 2 3 - 1. 



8 Time series 

Assume that we are given some time dependent p-adically encoded data, i.e. a set of 
p-adic numbers 

S t = {s (t),...,S n (t)} 

at some instances of time t = 0, 1, . . . , N. We assume that oo G St and that there are no 
"collisions" at any time, if we may use the language of "particles" moving inside some 
"space" . This corresponds to the p-adic projective line with n + 1 points removed: 

X t :=^\S U 

which is the usual way of denoting an n + 1-pointed genus zero curve. If we normalise 
for each t via some fractional linear map 

az + b 
z i ► 

cz + d 

the punctures St to contain 0, 1, oo, we have in X t a standard representative of a point 
Xt inside the moduli space M 0jTl +i ofn + 1-pointed genus zero curves defined over Q p . In 
the language of moduli spaces, the time series St corresponds to a family X t of punctured 
curves which in turn comes from a map 

{0,l,...,JV}-> M 0l „ + i. 

Collisions can also be treated in this way: simply replace M 0j „+i by a suitable com- 
pactification M . n +i in which the boundary corresponds to all possible ways of colliding 
particles. 




Figure 3: Edge contraction. 
There is now an infinite-to-one map 

II: M ,„+i P 1 \Swr(S) 



into the space T> n of all dendrograms for n data and oo. The fibre of a point x G T> n 
corresponds to the infinitely many possible p-adic encodings of the dendrogram associated 
to x. Hence, these correspond to the sections /: V n — > Mo „, i.e. maps satisfying 

If o / = id Vn . 

The space T> n is a polyhedral complex of dimension 

dimP„ = dimMo,„+i = (n + 1) — 3, 

where the subtraction of 3 comes from the normalisation after which 3 points are fixed. 
The maximal cells V n are all of the dimension of the moduli space and consist of the 
binary dendrograms. A cell in T> n is characterised by the fact that the abstract trees 
corresponding to its elements are all isomorphic, whereas the edge lengths vary. Passing 
to a neighbouring cell amounts to contracting an edge as illustrated in Figure [3J 

9 Genus 1 time series 



3C 



OC 



DC 1 



1 



etc. 



Figure 4: A sequence of dendrograms. 



Consider the sequence of dendrograms as given in Figure 0J We can view this as a 
vertex Vt determined by x at time t "jumping" along the geodesic line between and 1 
with respect to the fixed vertex determined by oo, as in Figure [5l 

If the vertex jumps at a constant rate, i.e. the distance at each time is the same, then 
we can model this by a translation along that geodesic line, or p-adically via a Mobius 
transformation 
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Figure 5: Vertex jump. 



where \c\ < 1. This corresponds for vt to a jump of distance — log p |c| to the right. This 
is the case of 7 being hyperbolic. 

By a change of coordinates taking (0, 1, 00) to (0, 00, 1) we transform everything said 
above to a hyperbolic action of the cyclic group (7) on the geodesic line between and 
00, or p-adically: on K x = P 1 \ {0, 00}. Hence, 7 has now the form 

7: Z f— ► c- z, 

and we obtain the commutative diagram 



X* 2 ^ x /<7) = E 

I I 

0- -00 > O 

in which E is a so-called Tate elliptic curve. It is a p-adic curve of genus 1, and the 
vertical wiggly arrows are the so-called reduction or tropicalisation maps. 

The above example generalises to the case of a discrete action on the p-adic projective 
line P 1 of a group G of fractional linear transformations inside PGI^if). If, in this case 
SI C P 1 is the domain on which F acts without limit or fixed points, then C = fi/r is 
known to be a so-called Mumford curve, a p-adic analogon of Riemann surface. 



10 Identifying p-adic Riemann surfaces 

Mumford curves are considered for p-adic higher genus string amplitudes in [5], where the 
authors call them p-adic Riemann surfaces. Unlike in the classical case, not all algebraic 
curves defined over Q p are p-adic Riemann surfaces. However, Chekhov et al. conjecture 
that the other curves do not contribute to the p-adic (or adelic) string amplitude [5j 
Conjecture §4.3]. This brings another physical motivation to the general problem of 
recognising Mumford curves among algebraic curves. 

Notice that the loop in the commutative diagram of the preceding section is of length 
— logJc and hence shrinks to zero, if \c\ approaches unity. In this case, the fractional 



linear transformation 7 is not hyperbolic, and there is no longer a discrete action of (7) 
on the geodesic. 

On the side of elliptic curves, this corresponds to the fact that the family of elliptic 
curves parametrised by 7 converges to a p-adic elliptic curve which is not a Tate curve. 
Such curves do exist, and they can be distinguished by their j-invariant. 

In fact, let the elliptic curve E be given by an equation over K in Legendre normal 
form 

E: y 2 = x(x - l)(x - A), 

where we may assume that |A| = 1 (this implies also |A — 1| < 1). Then, if K is a 
sufficiently large finite extension of Q p , it holds true by [H Ex. 3.8] that 

E is a Tate curve \j(E)\ > |2|* 
^|A-1|< \2\l 

This result was already known for the case p > 2, in which \2\ p = 1 [13l Thm. 5]. The 
last equivalence follows from a well-known formula relating A and the j-invariant. In 
order to show that the first and third statements are equivalent, one can consider the 
cover cf>: E — > P 1 of degree 2 defined by the Legendre equation: <p is simply projection 
onto the ^-coordinate. This induces a cover of degree 2 of the tree ^({0, 1, 00, A}) as 
depicted in Figure [SJ The proof then consists of calculating the infimum of I for which 
the upper graph still represents a Tate curve [4]. For p = 2, the intuition inf(£) = fails 
because of too many fixed vertices of the elliptic involution on the Bruhat-Tits tree. 




Figure 6: Tate curve covering P 1 , graphically. 

In the case of general Mumford curves (or p-adic Riemann surfaces, if one wishes), one 
can study the cover Q — > fl/G — C in a similar combinatorial way. Here, the so-called 
Hurwitz spaces, which are moduli spaces for covers between curves, come into play. It 
turns out that the question whether the upper curve in a cover / : X — > Y is a Mumford 



curve is subtle. Only a restricted type of covers / can in principle allow X to be a 
Mumford curve, and even then the answer depends on the position of the branch points 
of the covering map / [21 E] . 

11 Conclusion 

A p-adic encoding of hierarchical data has been discussed from a geometric point of 
view. Any dendrogram can in this way be viewed as a subtree of the Bruhat-Tits tree for 
PGL2(-ftT) defined over a p-adic field K large enough to encapture the maximal number of 
children vertices in the dendrogram. This is possible without changing the prime number 
p. The philosophical result is that cluster analysis becomes the finding of a suitable p-adic 
encoding of data, because then the dendrogram is uniquely determined by the ultrametric 
geometry. As an example, strings over a finite alphabet have been considered, where the 
p-adic distance coincides with the Baire distance. Application to encoding of DNA has 
been discussed, where the general question is raised which arithmetic conditions on a 
2-adic field K must be imposed from the point of view of genomics. 

A consideration of time series of hierarchical data leads to families of dendrograms or 
n-pointed p-adic projective lines and their moduli spaces as a natural geometric frame- 
work. Higher genus p-adic algebraic curves come into the scene, if a time series can be 
modelled via a discrete action of fractional linear transformations on the p-adic Riemann 
sphere. This and p-adic multiloop calculations in string theory [5] motivate the question 
of how to decide whether a given algebraic curve of higher genus is a p-adic Riemann 
surface. 

It is the hope that methods from p-adic string theory and enumerative geometry will 
eventually find their way into hierarchical data analysis. 
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